1 Load the data

The high-throughput poly(A) RNA-seq data used in this notebook are described in Neeves et al, Brain (2022). They are derived from nuclear and cytoplasmic fractions of human induced pluripotent stem cells (hiPSC; day 0), neural precursors (NPC; day 3 and day 7), ā€˜patterned’ precursor motor neurons (ventral spinal cord; pMN; day 14), post-mitotic but electrophysiologically immature motor neurons (MN; day 22), and electrophysiologically active MNs (mMNs; day 35).

Schematic depicting the iPSC differentiation strategy for motor neurogenesis. Arrows indicate sampling time-points in days when cells were fractionated into nuclear and cytoplasmic compartments prior to deep (polyA) RNA-sequencing. Four iPSC clones were obtained from four different healthy controls and three iPSC clones from two ALS patients with VCP mutations: R155C and R191Q; hereafter termed VCPmu. Induced-pluripotent stem cells (iPSC); neural precursors (NPC); ā€œpatternedā€ precursor motor neurons (ventral spinal cord; pMN); post-mitotic but electrophysiologically inactive motor neurons (MN); electrophysiologically active MNs (mMN). The gene expression count data was obtained from Kallisto (Bray et al., 2016) using the Gencode hg38 release Homo sapiens transcriptome. The quantification of alternative splicing was performed using VAST-TOOLS (Tapial et al., 2017). The data required for this practical session can be downloaded from Zenodo (please note this is a different version of the one in the previous module).

load("./AS_GE_data.RData")

#Data: 
# info                    : data-frame of information for the nuclear samples
# myE_gen                 : noramlised gene expression count matrix of the CTRL nuclear fraction quantile-normalised
# dat_diff_time_nuc_ctrl  : data-frame with different AS analysis for the nuclear compartment
# dat_diff_time_cyto_ctrl : data-frame with different AS analysis for the cytoplasmic compartment
# tab_psi                 : data frame with all events in VAST-TOOLS and their PSI values in each sample 

#Here I make some nice colors to facilitate the visualisation
mygroup                <- factor(as.character(info$DIV),levels=c(0,3,7,14,22,35))
mycols_days            <- c("#CCFF00","#33CC33","#669999","#6699FF","#3300FF","#CC33CC")
names(mycols_days)     <- c(0,3,7,14,22,35)
mycols                 <- unlist(lapply(info$DIV,function(Z)return(mycols_days[match(as.character(Z),names(mycols_days))])))

2 Some useful functions for your analysis

#Biological Pathway Gene Enrichment Analysis
GO_analysis <- function(genes_list){
  gostres_diff <- gost(query = genes_list, 
                  organism = "hsapiens", ordered_query = FALSE, 
                  multi_query = FALSE, significant = TRUE, exclude_iea = FALSE, 
                  measure_underrepresentation = FALSE, evcodes = TRUE, 
                  user_threshold = 0.05, correction_method = "g_SCS", 
                  domain_scope = "annotated", custom_bg = NULL, 
                  numeric_ns = "", as_short_link = FALSE,sources=c("GO:BP", "GO:MF","KEGG"))
  gostplot(gostres_diff, capped = FALSE, interactive = TRUE)#please note this is going to create an interactive plot
}

#Create a venn diagram from two gene lists 
## This is an example - not meant to run
vennDiag <- function(genes_lists){
  genes_comparisons <- do.call(what=cbind,args=lapply(genes_lists,function(Z)return(rownames(myE_gen)%in%Z)))
  colnames(genes_comparisons)<-c("cond1","cond2")
  vennDiagram(genes_comparisons,main="blasdfas")
}

#Plot PSI events over time in nuclear cyto fractions
plotEventOverTime <- function(event="HsaINT0002504"){
  toi     <- c("E.dPsi.d_3_0","E.dPsi.d_7_0","E.dPsi.d_14_0","E.dPsi.d_22_0","E.dPsi.d_35_0")
  dat_nuc <- dat_diff_time_nuc_ctrl[match(event,dat_diff_time_nuc_ctrl$EVENT),match(toi,colnames(dat_diff_time_nuc_ctrl))]
  dat_cyto<- dat_diff_time_cyto_ctrl[match(event,dat_diff_time_cyto_ctrl$EVENT),match(toi,colnames(dat_diff_time_cyto_ctrl))]
  tempdat <- rbind(as.numeric(dat_nuc),as.numeric(dat_cyto))
  rownames(tempdat)<-c("nuc","cyto")
  colnames(tempdat)<- c("3","7","14","22","35")
  tempdat[is.na(tempdat)]<-0
  barplot(tempdat,beside=TRUE,las=1,ylab="dPSI",xlab="DIV",main=paste(tab_psi$GENE[match(event,tab_psi$EVENT)], "-", tab_psi$COORD[match(event,tab_psi$EVENT)]),col=c("#CC9C3D","#476DB4"))
  legend("topleft",ncol=1,pch=15,col=c("#CC9C3D","#476DB4"),leg=c("nuc","cyto"),cex=0.7)
  grid()
}

#To following function sort myvec in decreasing order and return their indexes:
myidx<- sort(x=c(1:2000),index.return=TRUE,decreasing=TRUE)$ix

3 Alternative splicing analysis

Let’s first analyse the relative number of alternative splicing (AS) events exhibiting significant changes between the iPSC stage (DIV=0) and the NPC stage (DIV=14), and compare those between nuclear (dat_diff_time_nuc_ctrl data-table) and cytoplasmic fractions (dat_diff_time_cyto_ctrl). The schematic below shows the four main types of events we are going to look at.
Schematic depicting the four different types of alternative splicing events we are going to study. Alternative 5’ start site (Alt5); Alternative exon (AltEx); Alternative 3’ end side (Alt3); intron retention (IR)
Schematic depicting the four different types of alternative splicing events we are going to study. Alternative 5’ start site (Alt5); Alternative exon (AltEx); Alternative 3’ end side (Alt3); intron retention (IR)

Here is a summary of the column content which you can find at on Github repo of VAST-TOOL:

3.1 Get familiar with your data

First you always need to make sure you understand the format of your data, the content of the columns and the rows. Start by checking the dimensions of the data using dim, nrow, ncol functions. Then if the data-count table is not too large, you can use the View function to visualise and explore its content. Alternatively you can print a couple of rows. Here you need to understand the output from VAST-TOOLS.

dim(tab_psi)
## [1] 365711    102
print(colnames(tab_psi))
##   [1] "GENE"          "EVENT"         "COORD"         "LENGTH"       
##   [5] "FullCO"        "COMPLEX"       "0_CB1D_cyto"   "0_CB1D_nuc"   
##   [9] "0_CB1E_cyto"   "0_CB1E_nuc"    "0_CTRL1_cyto"  "0_CTRL1_nuc"  
##  [13] "0_CTRL3_cyto"  "0_CTRL3_nuc"   "0_CTRL4_cyto"  "0_CTRL4_nuc"  
##  [17] "0_CTRL5_cyto"  "0_CTRL5_nuc"   "0_GliA_cyto"   "0_GliA_nuc"   
##  [21] "0_GliB_cyto"   "14_CB1D_cyto"  "14_CB1D_nuc"   "14_CB1E_cyto" 
##  [25] "14_CB1E_nuc"   "14_CTRL1_cyto" "14_CTRL1_nuc"  "14_CTRL3_cyto"
##  [29] "14_CTRL3_nuc"  "14_CTRL4_cyto" "14_CTRL4_nuc"  "14_CTRL5_cyto"
##  [33] "14_CTRL5_nuc"  "14_GliA_cyto"  "14_GliA_nuc"   "14_GliB_cyto" 
##  [37] "14_GliB_nuc"   "22_CB1D_cyto"  "22_CB1D_nuc"   "22_CB1E_cyto" 
##  [41] "22_CB1E_nuc"   "22_CTRL1_cyto" "22_CTRL1_nuc"  "22_CTRL3_cyto"
##  [45] "22_CTRL3_nuc"  "22_CTRL4_cyto" "22_CTRL4_nuc"  "22_CTRL5_cyto"
##  [49] "22_CTRL5_nuc"  "22_GliA_cyto"  "22_GliA_nuc"   "22_GliB_cyto" 
##  [53] "22_GliB_nuc"   "35_CB1D_cyto"  "35_CB1D_nuc"   "35_CB1E_cyto" 
##  [57] "35_CB1E_nuc"   "35_CTRL1_cyto" "35_CTRL1_nuc"  "35_CTRL3_cyto"
##  [61] "35_CTRL3_nuc"  "35_CTRL4_cyto" "35_CTRL4_nuc"  "35_CTRL5_cyto"
##  [65] "35_CTRL5_nuc"  "35_GliA_cyto"  "35_GliA_nuc"   "35_GliB_cyto" 
##  [69] "35_GliB_nuc"   "3_CB1D_cyto"   "3_CB1D_nuc"    "3_CB1E_cyto"  
##  [73] "3_CB1E_nuc"    "3_CTRL1_cyto"  "3_CTRL1_nuc"   "3_CTRL3_cyto" 
##  [77] "3_CTRL3_nuc"   "3_CTRL4_cyto"  "3_CTRL4_nuc"   "3_CTRL5_cyto" 
##  [81] "3_CTRL5_nuc"   "3_GliA_cyto"   "3_GliA_nuc"    "3_GliB_cyto"  
##  [85] "3_GliB_nuc"    "7_CB1D_cyto"   "7_CB1D_nuc"    "7_CB1E_cyto"  
##  [89] "7_CB1E_nuc"    "7_CTRL1_cyto"  "7_CTRL1_nuc"   "7_CTRL3_cyto" 
##  [93] "7_CTRL3_nuc"   "7_CTRL4_cyto"  "7_CTRL4_nuc"   "7_CTRL5_cyto" 
##  [97] "7_CTRL5_nuc"   "7_GliA_cyto"   "7_GliA_nuc"    "7_GliB_cyto"  
## [101] "7_GliB_nuc"    "COMPLEX_short"
dim(dat_diff_time_nuc_ctrl)
## [1] 365711     21
print(colnames(dat_diff_time_cyto_ctrl))
##  [1] "GENE"            "EVENT"           "COORD"           "LENGTH"         
##  [5] "FullCO"          "COMPLEX"         "E.dPsi.d_3_0"    "MV.dPsi.d_3_0"  
##  [9] "E.dPsi.d_7_0"    "MV.dPsi.d_7_0"   "E.dPsi.d_7_3"    "MV.dPsi.d_7_3"  
## [13] "E.dPsi.d_14_0"   "MV.dPsi.d_14_0"  "E.dPsi.d_22_0"   "MV.dPsi.d_22_0" 
## [17] "E.dPsi.d_35_0"   "MV.dPsi.d_35_0"  "E.dPsi.d_35_22"  "MV.dPsi.d_35_22"
## [21] "COMPLEX_short"
dim(dat_diff_time_cyto_ctrl)
## [1] 365711     21
print(colnames(dat_diff_time_nuc_ctrl))
##  [1] "GENE"            "EVENT"           "COORD"           "LENGTH"         
##  [5] "FullCO"          "COMPLEX"         "E.dPsi.d_3_0"    "MV.dPsi.d_3_0"  
##  [9] "E.dPsi.d_7_0"    "MV.dPsi.d_7_0"   "E.dPsi.d_7_3"    "MV.dPsi.d_7_3"  
## [13] "E.dPsi.d_14_0"   "MV.dPsi.d_14_0"  "E.dPsi.d_22_0"   "MV.dPsi.d_22_0" 
## [17] "E.dPsi.d_35_0"   "MV.dPsi.d_35_0"  "E.dPsi.d_35_22"  "MV.dPsi.d_35_22"
## [21] "COMPLEX_short"

3.2 Overview of the relative number of changes of specific types between nucleus and cytoplasm

It is generally good to get an understanding of the type of events which are differentially spliced between two conditions. Pie-charts can be useful to visualise the relative abundances of events in each categories between different types of conditions. \(\Delta\)PSI>0 indicates inclusion while \(\Delta\)PSI<0 indicates skipping between two conditions. The default threshold of a difference in PSI is 15%.

Let’s start by looking at the relative number of events with a threshold difference of 20% between the iPSC stage (DIV=0) and the MN stage (DIV=35) in the nucleus:

col_events=c("#CD4599","#8C8C8C","#6ABD45","#4B5493","#FEC20F")
names(col_events) <- names(table(dat_diff_time_nuc_ctrl$COMPLEX_short))
par(mfrow=c(1,3),mar=c(1,1,2,1))
pie(table(dat_diff_time_nuc_ctrl$COMPLEX_short),col=col_events)
mtext(side=3,line=-1,text="All annotated events",cex=0.7)
pie(table(dat_diff_time_nuc_ctrl$COMPLEX_short[dat_diff_time_nuc_ctrl$E.dPsi.d_35_0>0.2]),col=col_events)
mtext(side=3,line=0,text="NUCLEUS -- DIV=35",cex=0.7)
mtext(side=3,line=-1,text="dPSI>20%",cex=0.7)
pie(table(dat_diff_time_nuc_ctrl$COMPLEX_short[dat_diff_time_nuc_ctrl$E.dPsi.d_35_0<(-0.2)]),col=col_events)
mtext(side=3,line=-1,text="dPSI<(-20%)",cex=0.7)

And compare these with event distribution in cytoplasmic fractions:

par(mfrow=c(1,3),mar=c(1,1,2,1))
pie(table(dat_diff_time_cyto_ctrl$COMPLEX_short),col=col_events)
mtext(side=3,line=-1,text="All annotated events",cex=0.7)
pie(table(dat_diff_time_cyto_ctrl$COMPLEX_short[dat_diff_time_cyto_ctrl$E.dPsi.d_35_0>0.2]),col=col_events)
mtext(side=3,line=0,text="CYTOPLASM --DIV=35",cex=0.7)
mtext(side=3,line=-1,text="dPSI>20%",cex=0.7)
pie(table(dat_diff_time_cyto_ctrl$COMPLEX_short[dat_diff_time_cyto_ctrl$E.dPsi.d_35_0<(-0.2)]),col=col_events)
mtext(side=3,line=-1,text="dPSI<(-20%)",cex=0.7)

What are the key observations?

3.3 Comparisons of the AS changes detected in nucleus and cytoplasm over time

Having found some differences at MN stages in relative distributions of included events between cyto and nuc (enrichment in nuclear fraction of IR), let’s now test whether splicing patterns are similar in nucleus versus cytoplasm over time.

pcc_events<- do.call(what=cbind,args=lapply(unique(dat_diff_time_cyto_ctrl$COMPLEX_short),function(EV)return(unlist(lapply(c("E.dPsi.d_3_0",   "E.dPsi.d_7_0" , "E.dPsi.d_14_0" , "E.dPsi.d_22_0" , "E.dPsi.d_35_0"),function(coln)return(cor(dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$COMPLEX_short==EV,match(coln,colnames(dat_diff_time_nuc_ctrl))],dat_diff_time_cyto_ctrl[dat_diff_time_cyto_ctrl$COMPLEX_short==EV,match(coln,colnames(dat_diff_time_cyto_ctrl))],use = "complete")))))))
colnames(pcc_events)<- unique(dat_diff_time_cyto_ctrl$COMPLEX_short)
rownames(pcc_events)<-c("E.dPsi.d_3_0",   "E.dPsi.d_7_0" , "E.dPsi.d_14_0" , "E.dPsi.d_22_0" , "E.dPsi.d_35_0") 

par(mfrow=c(1,1))
barplot(pcc_events,beside=TRUE,col=unlist(lapply(colnames(pcc_events),function(Z)return(rep(col_events[match(Z,names(col_events))],5)))),las=1,ylim=c(0,1),ylab="Pearson Correlation Coefficient")

Let’s now visualise these results, in particular the changes between NPC and MN stages in terms of alternative exon usage and intron retention:

par(mfrow=c(2,2),mar=c(4,4,4,4))
plot(dat_diff_time_nuc_ctrl$E.dPsi.d_14_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="IR"],dat_diff_time_cyto_ctrl$E.dPsi.d_14_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="IR"],pch=19,col=rgb(0,0,0,0.2),cex=0.5,xlab="nucleus",ylab="cytoplasm",main="",las=1)
mtext(side=3,line=2,text="Intron Retention",cex=0.7)
mtext(side=3,line=1,text="dPSI [DIV=0-->DIV=14]",cex=0.7)
grid()
abline(0,1,col="red",lty=2)
mtext(side=3,line=0,text=paste("PCC=",round(cor(dat_diff_time_nuc_ctrl$E.dPsi.d_14_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="IR"],dat_diff_time_cyto_ctrl$E.dPsi.d_14_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="IR"],use = "complete"),digit=2)),cex=0.7)

plot(dat_diff_time_nuc_ctrl$E.dPsi.d_14_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="EX"],dat_diff_time_cyto_ctrl$E.dPsi.d_14_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="EX"],pch=19,col=rgb(0,0,0,0.2),cex=0.5,xlab="nucleus",ylab="cytoplasm",main="",las=1)
mtext(side=3,line=2,text="Alternative Exon",cex=0.7)
mtext(side=3,line=1,text="dPSI [DIV=0-->DIV=14]",cex=0.7)
mtext(side=3,line=0,text=paste("PCC=",round(cor(dat_diff_time_nuc_ctrl$E.dPsi.d_14_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="EX"],dat_diff_time_cyto_ctrl$E.dPsi.d_14_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="EX"],use = "complete"),digit=2)),cex=0.7)
grid()
abline(0,1,col="red",lty=2)

plot(dat_diff_time_nuc_ctrl$E.dPsi.d_35_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="IR"],dat_diff_time_cyto_ctrl$E.dPsi.d_35_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="IR"],pch=19,col=rgb(0,0,0,0.2),cex=0.5,xlab="nucleus",ylab="cytoplasm",main="",las=1)
mtext(side=3,line=2,text="Intron Retention",cex=0.7)
mtext(side=3,line=1,text="dPSI [DIV=0-->DIV=35]",cex=0.7)
mtext(side=3,line=0,text=paste("PCC=",round(cor(dat_diff_time_nuc_ctrl$E.dPsi.d_35_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="IR"],dat_diff_time_cyto_ctrl$E.dPsi.d_35_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="IR"],use = "complete"),digit=2)),cex=0.7)
grid()
abline(0,1,col="red",lty=2)

plot(dat_diff_time_nuc_ctrl$E.dPsi.d_35_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="EX"],dat_diff_time_cyto_ctrl$E.dPsi.d_35_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="EX"],pch=19,col=rgb(0,0,0,0.2),cex=0.5,xlab="nucleus",ylab="cytoplasm",main="",las=1)
mtext(side=3,line=2,text="Alternative Exon",cex=0.7)
mtext(side=3,line=1,text="dPSI [DIV=0-->DIV=35]",cex=0.7)
mtext(side=3,line=0,text=paste("PCC=",round(cor(dat_diff_time_nuc_ctrl$E.dPsi.d_35_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="EX"],dat_diff_time_cyto_ctrl$E.dPsi.d_35_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="EX"],use = "complete"),digit=2)),cex=0.7)
grid()
abline(0,1,col="red",lty=2)

If interested have a look at the role of nuclear detention of intron-retaining transcripts.

4 Singular value decomposition (SVD)

While hierarchical clustering might be dominated by some changes in few genes, other methods such as singular value decomposition (SVD) analysis can be instrumental in identifying independent biological pathways underlying changes in gene expression. It is also very appropriate for time-series analysis when variation is not linearly dependent on time.

4.1 Relevant functions for SVD analysis

#This function extract the fraction of variance per component as derived from SVD analysis
#Input= SVD object 
getFractionVariance<- function(mySVD){
  return(mySVD$d*mySVD$d/sum(mySVD$d*mySVD$d))
}

#This function calculate the Shannon Entropy which indicates how well the information is spread among the principal components.
#High value indicates information evenly distributed among components
#Low value indicates most information/variance is captured by a single component
#Takes as input the fraction of variance outputted from getFractionVariance
getShannonEntropy <- function(pip)return(-1*sum(pip*log10(pip))/log10(length(pip)))

#This function plots the fraction of variance captured by first N components.
#pi=fraction of variance as obtained from getFractionVariance
#dp=Shannon Entropy as obtained from getShannonEntropy
#N=number of components to plot
ScrePlot <- function(pi,dp,N=NA){
  if(is.na(N)){
  barplot(pi,las=1,cex.main=0.7,cex.axis=1.0,col="black")
  }
  else{
    barplot(pi[c(1:N)],las=1,cex.main=0.7,cex.axis=1.0,col="black")
  }
  mtext(side = 1, line = 2, text ="principal components", col = "black",cex=0.7, font=1)
  mtext(side = 2, line = 2, text ="Fraction of explained variance", col = "black",cex=0.7, font=1)
  mtext(side = 3, line = -2, text = paste("Shannon Entropy = ",round(dp,digits=3)), col = "black",cex=0.7, font=1)
}

4.2 Prepare relevant matrices

For this analysis we will focus on the nuclear samples and use the different types of data collected so far i.e.Ā gene expression, PUD, IR and AltEx. A similar analysis can be done on the cytoplasmic fraction.

ge                <- myE_gen
EX_nuc            <- as.matrix(tab_psi[which(tab_psi$COMPLEX_short=="EX"),match(colnames(myE_gen),colnames(tab_psi))])
IR_nuc            <- as.matrix(tab_psi[which(tab_psi$COMPLEX_short=="IR"),match(colnames(myE_gen),colnames(tab_psi))])

rownames(EX_nuc)  <- as.character(tab_psi$EVENT)[which(tab_psi$COMPLEX_short=="EX")]
rownames(IR_nuc)  <- as.character(tab_psi$EVENT)[which(tab_psi$COMPLEX_short=="IR")]

#You need to remove NA's and NaN for the unsupervised analysis
tosel_ex          <- which(apply(is.na(EX_nuc),1,sum)==0&apply(is.nan(EX_nuc),1,sum)==0)
tosel_ir          <- which(apply(is.na(IR_nuc),1,sum)==0&apply(is.nan(IR_nuc),1,sum)==0)
EX_nuc            <- EX_nuc[tosel_ex,]
IR_nuc            <- IR_nuc[tosel_ir,]

4.3 Systematic removal of the first component: effect on GE matrix

It is common in SVD analysis to remove the first component which captures noise == most variations from from systematic changes in basal gene expression between genes.

#SVD on the Gene Expression
dat1         <- ge
SVD_eset     <- svd(dat1)
pi_1         <- getFractionVariance(mySVD=SVD_eset)
dp_1         <- getShannonEntropy(pi_1)


datm         <- dat1-SVD_eset$d[1]*(SVD_eset$u[,1]%*%t(SVD_eset$v[,1]))
SVD_eset_GE  <- svd(datm)
pi_2         <- getFractionVariance(mySVD=SVD_eset_GE)
dp_2         <- getShannonEntropy(pi_2)


par(mfrow=c(2,4))
ScrePlot(pi_1,dp_1,N=NA)
mtext(side=3,line=0,text="prior removing first component")
#Effect on the sample distribution
multidensity(dat1[,c(1,10,15,20)],main="prior filtering",las=1)
boxplot(dat1,outline=FALSE)
boxplot(t(dat1[c(1,4,10,44,69,1000,2000,4000),]),outline=FALSE,las=2,ylab="log2(count)")

ScrePlot(pi_2,dp_2,N=NA)
mtext(side=3,line=0,text="after removing first component")
multidensity(datm[,c(1,10,15,20)],main="after filtering",las=1)
boxplot(datm,outline=FALSE)
#Effect on the genes distribution
boxplot(t(datm[c(1,4,10,44,69,1000,2000,4000),]),outline=FALSE,las=2,ylab="log2(count)")

4.4 Perform SVD on individual matrices (AltEx, IR, and GE)

We start by performing SVD on individual matrices and compare how information is captured in the different matrices.

layout(matrix(c(1:6),ncol=3,nrow=2,byrow=FALSE))

#1. SVD on the Gene Expression
dat1         <- ge
SVD_eset     <- svd(dat1)
ScrePlot(getFractionVariance(SVD_eset),getShannonEntropy(getFractionVariance(SVD_eset)))
mtext(side=3,line=0,text="Gene Expression")
datm         <- dat1-SVD_eset$d[1]*(SVD_eset$u[,1]%*%t(SVD_eset$v[,1]))
SVD_eset_GE  <- svd(datm)
ScrePlot(getFractionVariance(SVD_eset_GE),getShannonEntropy(getFractionVariance(SVD_eset_GE)))
mtext(side=3,line=0,text="Gene Expression")
#2. SVD on the AltEx
dat1         <- EX_nuc
SVD_eset     <- svd(dat1)
ScrePlot(getFractionVariance(SVD_eset),getShannonEntropy(getFractionVariance(SVD_eset)))
mtext(side=3,line=0,text="AltEx")
mtext(side=3,line=1,text="Prior removing first component")
datm         <- dat1-SVD_eset$d[1]*(SVD_eset$u[,1]%*%t(SVD_eset$v[,1]))
SVD_eset_Ex  <- svd(datm)
ScrePlot(getFractionVariance(SVD_eset_Ex),getShannonEntropy(getFractionVariance(SVD_eset_Ex)))
mtext(side=3,line=1,text="After removing first component")
mtext(side=3,line=0,text="AltEx")
#3. SVD on the IR
dat1         <- IR_nuc
SVD_eset     <- svd(dat1)
ScrePlot(getFractionVariance(SVD_eset),getShannonEntropy(getFractionVariance(SVD_eset)))
mtext(side=3,line=0,text="IR")
datm         <- dat1-SVD_eset$d[1]*(SVD_eset$u[,1]%*%t(SVD_eset$v[,1]))
SVD_eset_IR  <- svd(datm)
ScrePlot(getFractionVariance(SVD_eset_IR),getShannonEntropy(getFractionVariance(SVD_eset_IR)))
mtext(side=3,line=0,text="IR")

Let’s now compare their first PCs to test who the time is captured by the different matrices in the different SVD.

#PCA plots
par(mfrow=c(3,3),mar=c(4,4,2,2))
plot(SVD_eset_GE$v[,1], SVD_eset_GE$v[,2],col=mycols,xlab="PC1",ylab="PC2",main="Gene Expression",pch=19,cex=1.5,las=1)
plot(SVD_eset_Ex$v[,1], SVD_eset_Ex$v[,2],col=mycols,xlab="PC1",ylab="PC2",main="AltEx",pch=19,cex=1.5,las=1)
plot(SVD_eset_IR$v[,1], SVD_eset_IR$v[,2],col=mycols,xlab="PC1",ylab="PC2",main="IR",pch=19,cex=1.5,las=1)


plot(SVD_eset_GE$v[,1], SVD_eset_GE$v[,3],col=mycols,xlab="PC1",ylab="PC3",main="Gene Expression",pch=19,cex=1.5,las=1)
plot(SVD_eset_Ex$v[,1], SVD_eset_Ex$v[,3],col=mycols,xlab="PC1",ylab="PC3",main="AltEx",pch=19,cex=1.5,las=1)
plot(SVD_eset_IR$v[,1], SVD_eset_IR$v[,3],col=mycols,xlab="PC1",ylab="PC3",main="IR",pch=19,cex=1.5,las=1)


plot(SVD_eset_GE$v[,2], SVD_eset_GE$v[,3],col=mycols,xlab="PC2",ylab="PC3",main="Gene Expression",pch=19,cex=1.5,las=1)
plot(SVD_eset_Ex$v[,2], SVD_eset_Ex$v[,3],col=mycols,xlab="PC2",ylab="PC3",main="AltEx",pch=19,cex=1.5,las=1)
plot(SVD_eset_IR$v[,2], SVD_eset_IR$v[,3],col=mycols,xlab="PC2",ylab="PC3",main="IR",pch=19,cex=1.5,las=1)

Do you see any differences in the clustering of the samples using the different count matrices?

4.5 The dynamic of the components over time

Let’s now look into the dynamic over time of the different component:

myMean <- list( t(apply(SVD_eset_GE$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=mean)))),
                t(apply(SVD_eset_Ex$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=mean)))),
                t(apply(SVD_eset_IR$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=mean))))               )
mySD  <- list(t(apply(SVD_eset_GE$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=sd)))),
              t(apply(SVD_eset_Ex$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=sd)))),
              t(apply(SVD_eset_IR$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=sd))))              )

#Plot component over time
days                <- c(0,3,7,14,21,35)
cats                <- c("Gene Expression","AltEx","IR","3UTR")
CEX<- 0.7
par(mfrow=c(3,3),mar=c(5,5,2,0),oma=c(2,2,2,2))
for(j in c(1:3)){
  for(i in c(1:3)){
    MIN=min(0,min(myMean[[j]][i,]-mySD[[j]][i,]))
    MAX=max(myMean[[j]][i,]+mySD[[j]][i,])
    plot(days,myMean[[j]][i,],pch=19,type="b",lty=2,ylim=c(MIN,MAX),las=1,frame="F",xlab="time [days]",cex=1.0,cex.axis=CEX,cex.lab=CEX,ylab="")
    mtext(side=2,line=3,text=paste("PC",i,sep=""),cex=CEX)
    mtext(side=3,line=0,text=cats[j],cex=CEX)
    grid()
    abline(h=0,col="red",lty=2)
  }
}

5 Graded homework

Your homework will be graded according to three criteria: 1) Correctness of your result; 2) Clarity of the visual output; 3) Description of your results demonstrating your ability to discuss your result in their biological context.

5.2 Taks 2. Hierarchical clustering analysis (0.25 pt)

Use hierarchical clustering analysis to test whether the time component is differentially captured by changes in gene expression, AltEx, and intron retention the nuclear fractions.

We will use clustering method based on the Manhattan distance and Ward D algorithm:

CEX=1.0
hcl_raw                          <- hclust(dist(t(ge),method="man"), method = "ward.D", members = NULL)
hcl_ex_nuc                     <- hclust(dist(t(EX_nuc),method="man"), method = "ward.D", members = NULL)
hcl_ir_nuc                       <- hclust(dist(t(IR_nuc),method="man"), method = "ward.D", members = NULL)

mytime                 <- factor(as.character(info$DIV),levels=c(0,3,7,14,22,35))
mycols_days            <- c("#CCFF00","#33CC33","#669999","#6699FF","#3300FF","#CC33CC")
names(mycols_days)     <- c(0,3,7,14,22,35)
mycols                 <- unlist(lapply(info$DIV,function(Z)return(mycols_days[match(as.character(Z),names(mycols_days))])))

par(mfrow=c(1,3), mar=c(1,0,3,1))
plot(ape::as.phylo(hcl_raw),tip.color=mycols,cex=CEX,label.offset = 0.001,no.margin = TRUE,use.edge.length=TRUE,direction="rightwards",plot=TRUE,font=1,main="Gene expression")
plot(ape::as.phylo(hcl_ex_nuc),tip.color=mycols,cex=CEX,label.offset = 0.001,no.margin = TRUE,use.edge.length=TRUE,direction="rightwards",plot=TRUE,font=1,main="AltEx")
plot(ape::as.phylo(hcl_ir_nuc),tip.color=mycols,cex=CEX,label.offset = 0.001,no.margin = TRUE,use.edge.length=TRUE,direction="rightwards",plot=TRUE,font=1,main="Intron retention")

From the plots above, we can conclude that time component is differentially captured by changes in gene expression, since control samples that belong to the same time points are clustered together.

5.3 Task 3. SVD analysis (0.25 pt)

  • Extract the top 500 most positively and 500 most negatively contributing genes to the first 5 components of SVD from IR.
column_names = colnames(IR_nuc)
row_names = rownames(IR_nuc)
# We perform SVD analyisis on IR_nuc table 
dat1 = IR_nuc
SVD_eset = svd(dat1)

# It is common in SVD analysis to remove the first component that captures noise 
datm = dat1-SVD_eset$d[1]*(SVD_eset$u[,1]%*%t(SVD_eset$v[,1]))
SVD_eset_IR = svd(datm)

# We use left-singular matrix
u = SVD_eset_IR$u

# We extract first 5 PC values
PC1 = u[,1]
PC2 = u[,2]
PC3 = u[,3]
PC4 = u[,4]
PC5 = u[,5]

# We need to sort them 
PC1_sorted = sort(x=PC1,index.return=TRUE,decreasing=TRUE)$ix
PC2_sorted = sort(x=PC2,index.return=TRUE,decreasing=TRUE)$ix
PC3_sorted = sort(x=PC3,index.return=TRUE,decreasing=TRUE)$ix
PC4_sorted = sort(x=PC4,index.return=TRUE,decreasing=TRUE)$ix
PC5_sorted = sort(x=PC5,index.return=TRUE,decreasing=TRUE)$ix

# We extract top 500 positively and negatively contributing genes for each PC
# PC1
top500_p_pc1 = PC1_sorted[1:500]
top500_n_pc1 = tail(PC1_sorted, 500)

top500_p_events_pc1 = IR_nuc[top500_p_pc1, ]
top500_n_events_pc1 = IR_nuc[top500_n_pc1, ]

# PC2
top500_p_pc2 = PC2_sorted[1:500]
top500_n_pc2 = tail(PC2_sorted, 500)

top500_p_events_pc2 = IR_nuc[top500_p_pc2, ]
top500_n_events_pc2 = IR_nuc[top500_n_pc2, ]

# PC3
top500_p_pc3 = PC3_sorted[1:500]
top500_n_pc3 = tail(PC3_sorted, 500)

top500_p_events_pc3 = IR_nuc[top500_p_pc3, ]
top500_n_events_pc3 = IR_nuc[top500_n_pc3, ]

# PC4
top500_p_pc4 = PC4_sorted[1:500]
top500_n_pc4 = tail(PC4_sorted, 500)

top500_p_events_pc4 = IR_nuc[top500_p_pc4, ]
top500_n_events_pc4 = IR_nuc[top500_n_pc4, ]

# PC5
top500_p_pc5 = PC5_sorted[1:500]
top500_n_pc5 = tail(PC5_sorted, 500)

top500_p_events_pc5 = IR_nuc[top500_p_pc5, ]
top500_n_events_pc5 = IR_nuc[top500_n_pc5, ]


par(mfrow=c(2,1))
boxplot(top500_p_events_pc1, main="PSI values over time for genes contributing positively to PC1", ylab="PSI")
boxplot(top500_n_events_pc1, main="PSI values over time for genes contributing negatively to PC1", ylab="PSI")

par(mfrow=c(2,1))
boxplot(top500_p_events_pc2, main="PSI values over time for genes contributing positively to PC2", ylab="PSI")
boxplot(top500_n_events_pc2, main="PSI values over time for genes contributing negatively to PC2", ylab="PSI")

par(mfrow=c(2,1))
boxplot(top500_p_events_pc3, main="PSI values over time for genes contributing positively to PC3", ylab="PSI")
boxplot(top500_n_events_pc3, main="PSI values over time for genes contributing negatively to PC3", ylab="PSI")

par(mfrow=c(2,1))
boxplot(top500_p_events_pc4, main="PSI values over time for genes contributing positively to PC4", ylab="PSI")
boxplot(top500_n_events_pc4, main="PSI values over time for genes contributing negatively to PC4", ylab="PSI")

par(mfrow=c(2,1))
boxplot(top500_p_events_pc5, main="PSI values over time for genes contributing positively to PC5", ylab="PSI")
boxplot(top500_n_events_pc5, main="PSI values over time for genes contributing negatively to PC5", ylab="PSI")

Hint: U left-singular matrix provides with loadings of the genes/transcripts in individual components.

5.4 Task 4 (Bonus). Enrichment of relevant PCi (0.25 pt)

Perform BP enrichment analysis on relevant PCs as identified in previous point.

# We need to extract the events that correspond to the top 500 in each PCi
# PC1
events_pc1_p = row_names[top500_p_pc1]
events_pc1_n = row_names[top500_n_pc1]

genes_pc1_p = dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$EVENT %in% events_pc1_p, ]$GENE
genes_pc1_n = dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$EVENT %in% events_pc1_n, ]$GENE

par(mfrow=c(1,1))
# Positively contributing genes
GO_analysis(genes_pc1_p)
# Negatively contributing genes
par(mfrow=c(1,1))
GO_analysis(genes_pc1_n)

From the 500 most positively contributing genes to the PC1 in terms of BP enrichment we have the ones related to the pathways of transport and cellular localization. For negatively contributing genes (PC1), BP enriched pathways are organelle localization, cytoskeleton organization, cell projection organization, plasma membrane bounded cell projection organization.

# PC2
events_pc2_p = row_names[top500_p_pc2]
events_pc2_n = row_names[top500_n_pc2]

genes_pc2_p = dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$EVENT %in% events_pc2_p, ]$GENE
genes_pc2_n = dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$EVENT %in% events_pc2_n, ]$GENE

# Positively contributing genes
par(mfrow=c(1,1))
GO_analysis(genes_pc2_p)
# Negatively contributing genes
par(mfrow=c(1,1))
GO_analysis(genes_pc2_n)

From the 500 most positively contributing genes to the PC2 in terms of BP enrichment we have the ones related to the pathways of organelle organization and transport. For negatively contributing genes (PC2), BP enriched pathways are organelle localization, regulation of nitrogen compound metabolic process, regulation of primary metabolic processes.

# PC3
events_pc3_p = row_names[top500_p_pc3]
events_pc3_n = row_names[top500_n_pc3]

genes_pc3_p = dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$EVENT %in% events_pc3_p, ]$GENE
genes_pc3_n = dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$EVENT %in% events_pc3_n, ]$GENE

# Positively contributing genes
par(mfrow=c(1,1))
GO_analysis(genes_pc3_p)
# Negatively contributing genes
par(mfrow=c(1,1))
GO_analysis(genes_pc3_n)

From the 500 most positively contributing genes to the PC3 in terms of BP enrichment we have the ones related to the pathways of nervous system development, multicellular organism development and system development. For negatively contributing genes (PC3), BP enriched pathways are vesicle-mediated transport, actin cytoskeleton organization and actin filament-based process.

# PC4
events_pc4_p = row_names[top500_p_pc4]
events_pc4_n = row_names[top500_n_pc4]

genes_pc4_p = dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$EVENT %in% events_pc4_p, ]$GENE
genes_pc4_n = dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$EVENT %in% events_pc4_n, ]$GENE

# Positively contributing genes
par(mfrow=c(1,1))
GO_analysis(genes_pc4_p)
# Negatively contributing genes
par(mfrow=c(1,1))
GO_analysis(genes_pc4_n)

From the 500 most positively contributing genes to the PC4 in terms of BP enrichment we have the ones related to the pathways of regulation of cellular component organization and biological regulation. For negatively contributing genes (PC4), BP enriched pathways are cell morphogenesis and system development.

# PC5
events_pc5_p = row_names[top500_p_pc5]
events_pc5_n = row_names[top500_n_pc5]

genes_pc5_p = dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$EVENT %in% events_pc5_p, ]$GENE
genes_pc5_n = dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$EVENT %in% events_pc5_n, ]$GENE

# Positively contributing genes
par(mfrow=c(1,1))
GO_analysis(genes_pc5_p)
# Negatively contributing genes
par(mfrow=c(1,1))
GO_analysis(genes_pc5_n)

From the 500 most positively contributing genes to the PC5 in terms of BP enrichment we have the ones related to the pathways of cellular response to stress and biological regulation. For negatively contributing genes (PC5), BP enriched pathways are regulation of primary metabolic process and regulation of nitrogen compound metabolic process.